25 research outputs found
Pixelwise Instance Segmentation with a Dynamically Instantiated Network
Semantic segmentation and object detection research have recently achieved
rapid progress. However, the former task has no notion of different instances
of the same object, and the latter operates at a coarse, bounding-box level. We
propose an Instance Segmentation system that produces a segmentation map where
each pixel is assigned an object class and instance identity label. Most
approaches adapt object detectors to produce segments instead of boxes. In
contrast, our method is based on an initial semantic segmentation module, which
feeds into an instance subnetwork. This subnetwork uses the initial
category-level segmentation, along with cues from the output of an object
detector, within an end-to-end CRF to predict instances. This part of our model
is dynamically instantiated to produce a variable number of instances per
image. Our end-to-end approach requires no post-processing and considers the
image holistically, instead of processing independent proposals. Therefore,
unlike some related work, a pixel cannot belong to multiple instances.
Furthermore, far more precise segmentations are achieved, as shown by our
state-of-the-art results (particularly at high IoU thresholds) on the Pascal
VOC and Cityscapes datasets.Comment: CVPR 201
Holistic, Instance-Level Human Parsing
Object parsing -- the task of decomposing an object into its semantic parts
-- has traditionally been formulated as a category-level segmentation problem.
Consequently, when there are multiple objects in an image, current methods
cannot count the number of objects in the scene, nor can they determine which
part belongs to which object. We address this problem by segmenting the parts
of objects at an instance-level, such that each pixel in the image is assigned
a part label, as well as the identity of the object it belongs to. Moreover, we
show how this approach benefits us in obtaining segmentations at coarser
granularities as well. Our proposed network is trained end-to-end given
detections, and begins with a category-level segmentation module. Thereafter, a
differentiable Conditional Random Field, defined over a variable number of
instances for every input image, reasons about the identity of each part by
associating it with a human detection. In contrast to other approaches, our
method can handle the varying number of people in each image and our holistic
network produces state-of-the-art results in instance-level part and human
segmentation, together with competitive results in category-level part
segmentation, all achieved by a single forward-pass through our neural network.Comment: Poster at BMVC 201
A Projected Gradient Descent Method for CRF Inference allowing End-To-End Training of Arbitrary Pairwise Potentials
Are we using the right potential functions in the Conditional Random Field
models that are popular in the Vision community? Semantic segmentation and
other pixel-level labelling tasks have made significant progress recently due
to the deep learning paradigm. However, most state-of-the-art structured
prediction methods also include a random field model with a hand-crafted
Gaussian potential to model spatial priors, label consistencies and
feature-based image conditioning.
In this paper, we challenge this view by developing a new inference and
learning framework which can learn pairwise CRF potentials restricted only by
their dependence on the image pixel values and the size of the support. Both
standard spatial and high-dimensional bilateral kernels are considered. Our
framework is based on the observation that CRF inference can be achieved via
projected gradient descent and consequently, can easily be integrated in deep
neural networks to allow for end-to-end training. It is empirically
demonstrated that such learned potentials can improve segmentation accuracy and
that certain label class interactions are indeed better modelled by a
non-Gaussian potential. In addition, we compare our inference method to the
commonly used mean-field algorithm. Our framework is evaluated on several
public benchmarks for semantic segmentation with improved performance compared
to previous state-of-the-art CNN+CRF models.Comment: Presented at EMMCVPR 2017 conferenc
How can objects help action recognition?
Current state-of-the-art video models process a video clip as a long sequence
of spatio-temporal tokens. However, they do not explicitly model objects, their
interactions across the video, and instead process all the tokens in the video.
In this paper, we investigate how we can use knowledge of objects to design
better video models, namely to process fewer tokens and to improve recognition
accuracy. This is in contrast to prior works which either drop tokens at the
cost of accuracy, or increase accuracy whilst also increasing the computation
required. First, we propose an object-guided token sampling strategy that
enables us to retain a small fraction of the input tokens with minimal impact
on accuracy. And second, we propose an object-aware attention module that
enriches our feature representation with object information and improves
overall accuracy. Our resulting framework achieves better performance when
using fewer tokens than strong baselines. In particular, we match our baseline
with 30%, 40%, and 60% of the input tokens on SomethingElse,
Something-something v2, and Epic-Kitchens, respectively. When we use our model
to process the same number of tokens as our baseline, we improve by 0.6 to 4.2
points on these datasets.Comment: CVPR 202
Dense Video Object Captioning from Disjoint Supervision
We propose a new task and model for dense video object captioning --
detecting, tracking, and captioning trajectories of all objects in a video.
This task unifies spatial and temporal understanding of the video, and requires
fine-grained language description. Our model for dense video object captioning
is trained end-to-end and consists of different modules for spatial
localization, tracking, and captioning. As such, we can train our model with a
mixture of disjoint tasks, and leverage diverse, large-scale datasets which
supervise different parts of our model. This results in noteworthy zero-shot
performance. Moreover, by finetuning a model from this initialization, we can
further improve our performance, surpassing strong image-based baselines by a
significant margin. Although we are not aware of other work performing this
task, we are able to repurpose existing video grounding datasets for our task,
namely VidSTG and VLN. We show our task is more general than grounding, and
models trained on our task can directly be applied to grounding by finding the
bounding box with the maximum likelihood of generating the query sentence. Our
model outperforms dedicated, state-of-the-art models for spatial grounding on
both VidSTG and VLN
Dynamic Graph Message Passing Networks
Modelling long-range dependencies is critical for complex scene understanding
tasks such as semantic segmentation and object detection. Although CNNs have
excelled in many computer vision tasks, they are still limited in capturing
long-range structured relationships as they typically consist of layers of
local kernels. A fully-connected graph is beneficial for such modelling,
however, its computational overhead is prohibitive. We propose a dynamic graph
message passing network, based on the message passing neural network framework,
that significantly reduces the computational complexity compared to related
works modelling a fully-connected graph. This is achieved by adaptively
sampling nodes in the graph, conditioned on the input, for message passing.
Based on the sampled nodes, we then dynamically predict node-dependent filter
weights and the affinity matrix for propagating information between them. Using
this model, we show significant improvements with respect to strong,
state-of-the-art baselines on three different tasks and backbone architectures.
Our approach also outperforms fully-connected graphs while using substantially
fewer floating point operations and parameters.Comment: CVPR 2020 Ora
Adaptive Computation with Elastic Input Sequence
Humans have the ability to adapt the type of information they use, the
procedure they employ, and the amount of time they spend when solving problems.
However, most standard neural networks have a fixed function type and
computation budget regardless of the sample's nature or difficulty. Adaptivity
is a powerful paradigm as it not only imbues practitioners with flexibility
pertaining to the downstream usage of these models but can also serve as a
powerful inductive bias for solving certain challenging classes of problems. In
this work, we introduce a new approach called AdaTape, which allows for dynamic
computation in neural networks through adaptive tape tokens. AdaTape utilizes
an elastic input sequence by equipping an architecture with a dynamic
read-and-write tape. Specifically, we adaptively generate input sequences using
tape tokens obtained from a tape bank which can be either trainable or derived
from input data. We examine the challenges and requirements to obtain dynamic
sequence content and length, and propose the Adaptive Tape Reading (ATR)
algorithm to achieve both goals. Through extensive experiments on image
recognition tasks, we show that AdaTape can achieve better performance while
maintaining the computational cost. To facilitate further research, we have
released code at https://github.com/google-research/scenic